Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FileService as a standalone microservice, LakeFS+S3 as dataset storage #3296

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

bobbai00
Copy link
Collaborator

@bobbai00 bobbai00 commented Mar 2, 2025

This PR introduces the FileService as another microservice parallel to WorkflowCompilingService, ComputingUnitMaster/Worker, and TexeraWebApplication.

Purpose of the FileService

  • We want to improve the performance of our current Git-based Dataset implementation;
  • We decide to go with LakeFS + S3, LakeFS for the version control metadata and S3 for data transfer; But LakeFS doesn't have access control layer
  • Therefore, we build the FileService, providing
    • all the APIs related to versioned files in datasets
    • access control

Architecture before and after adding FileService

Before:
Screenshot 2025-03-02 at 9 01 20 AM

After
Screenshot 2025-03-02 at 8 53 32 AM

Key Changes

  • A new service, FileService is introduced. All the dataset-related endpoints are hosted on FileService
  • Several configuration items related to LakeFS and S3 are introduced in the storage-config.yaml
  • Frontend UI updates to incorporate with new changes
  • For ComputingUnitMaster and ComputingUnitWorker, they will call FileService to read files, during which their access will be verified. So in the dynamic computing architecture (which will be introduced in Add computing unit manager service #3298), they will send requests along with current user's token. In single-machine architecture, they are bypassing the network requests by doing direct local function calls.
  • Python UDF can now directly read dataset's file by the following example code:
file = DatasetFileDocument("The URL of the file")
bytes = file.read_file() # return an io.Bytes object

You may refer to core/amber/src/main/python/pytexera/storage/dataset_file_document.py for implementation details. This feature is only available in the dynamic computing architecture.

How to migrate the previous datasets to the new datasets managed by the LakeFS

As we did quite some refactoring, two dataset implementations are NOT compatible with each others. To migrate the previous datasets to the latest implementation, you will need to re-upload the data via the new UI.

How to deploy new architecture

Step1. Deploy LakeFS & Minio

Use Docker (Highly recommended for local development)

  • Go to directory: core/file-service/src/main/resources
  • Execute docker-compose --profile local-lakefs up -d at its directory

Use Binary (Recommended for production deployment)

Refer to https://docs.lakefs.io/howto/deploy/

Step2. Configure the storage-config.yaml

Configure the below section in the storage-config.yaml:

  lakefs:
    endpoint: ""
    auth:
      api-secret: ""
      username: ""
      password: ""
    block-storage:
      type: ""
      bucket-name: ""

  s3:
    endpoint: ""
    auth:
      username: ""
      password: ""

Here is the configuration you can directly use if you are using the core/file-service/src/main/resources/docker-compose.yml to install LakeFS & Minio:

  lakefs:
    endpoint: "http://127.0.0.1:8000/api/v1"
    auth:
      api-secret: "random_string_for_lakefs"
      username: "AKIAIOSFOLKFSSAMPLES"
      password: "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
    block-storage:
      type: "s3"
      bucket-name: "texera-dataset"

  s3:
    endpoint: "http://localhost:9000"
    auth:
      username: "texera_minio"
      password: "password"

Step3. Launch services

Launch FileService , in addition to TexeraWebApplication, WorkflowCompilingService and ComputingUnitMaster.

Future PRs after this one

  • Remove the Dataset related endpoints completely from amber package.
  • Incorporate the deployment of LakeFS+S3 in the Helm chart of K8s-based deployment.
  • Some optimizations:
    • for small files, directly upload it instead of using multipart upload
    • when doing result exports, use multipart upload as the result size can be quite big.
    • support re-transmit for partially-uploaded files.

@bobbai00 bobbai00 changed the title Add FileService as a standalone microservice, and LakeFS+S3 as dataset storage Add FileService as a standalone microservice, LakeFS+S3 as dataset storage Mar 2, 2025
@bobbai00 bobbai00 marked this pull request as ready for review March 2, 2025 22:16
@bobbai00 bobbai00 self-assigned this Mar 3, 2025
@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from 5db607e to 0e7a9d8 Compare March 3, 2025 18:39
Copy link
Collaborator

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from 6d101a1 to da96a2a Compare March 4, 2025 21:26
@bobbai00
Copy link
Collaborator Author

bobbai00 commented Mar 4, 2025

Since installing LakeFS and Minio can be complex, could we add a frontend flag that allows developers to enable the user system without requiring LakeFS and Minio? This would let developers read files directly from their local file system when the user system is enabled.

OK. I have added a flag in environment.default.ts

Copy link
Collaborator

@aglinxinyuan aglinxinyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Tested on both Windows and Mac. The setup is very smooth.
Please add more details on step3. For example, how can developers migrate to this from current master.

@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch 2 times, most recently from 07d4789 to 2e38fad Compare March 9, 2025 21:53
@bobbai00 bobbai00 force-pushed the jiadong-add-file-service branch from 2e38fad to 6accd5b Compare March 10, 2025 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants